Project presentation

Ander Barrio Campos(231938), Dionysios Dimitreas(s232752), Erikas Mikužis(s223164), Valeria Tedeschi(s231945), Angeliki Vliora(233059)

Introduction

Tidyverse: enhances data manipulation and visualization with a tidy data workflow, fostering code that is

  • readable
  • maintainable
  • reproducible
  • Core packages ggplot2, dplyr, tidyr, readr, broom

Our Dataset

Source: Behavioral Risk Factor Surveillance System (BRFSS) 2015.

Key Features: Health indicators related to diabetes, including:

  • lifestyle factors
  • health outcomes
  • demographic information

:::

Introduction

Research Questions

  1. What are the key predictive variables in diabetes prognosis?

  2. How does gender influence the manifestation and progression of diabetes?

Materials and Methods

Data Cleaning and Augmentation

Data Cleaning

  • Removed Missing Values: df_cleaned <- df |> drop_na()

  • Verified Data Types: column_types <- summarise(df_cleaned, across(everything(), class))

  • Filtered Incorrect Values: Filtered out rows with values outside expected ranges.

Data Augmentation

  • Transformed Variables: Binary to categorical (e.g., Smoker to Smoking Status).

  • Created New Variables: E.g., Habits, Health Risk, based on lifestyle and health indicators.

  • Socio-Economic Class: Derived from income, education, and healthcare status.

Data Analysis

Correlation

  • Purpose: Check the relationship among numerical variables.

  • Two types: Between all the variables and only with the target variable (Diabetes_binary).

GLM

  • All variables: Creation of a GLM with all numerical variables.

  • Step: Step forward and backward for best variables selection.

  • Evaluation: Selection of lowest AIC model and analyse the selected variables.

Data Analysis

PCA + Logistic Regression

  • Purpose: Decrease number of variables as we had plenty of them.

  • Logistic regression: Use of those components to perform a diabetes prediction model.

  • Evaluation: Confusion matrix and accuracy.

New GLMs

  • Purpose: Check gender influence in the diagnosis of diabetes and tackle gender bias in science.

  • New datasets: Creation of two different datasets according to sex.

  • Evaluation: Compare the selected variables for each model and selection of the best model according to AIC.

Results

  • Smoking status affects mostly the youngest age groups
  • Similar behavior for other age groups
  • BMI tends to increase over the age

  • Among Healthy individuals, women have healthier habits
  • More non-diabetic women have an Average lifestyle than men, opposite for the diabetic counterparts.

Results

Key Findings:

  • Correlation Analysis:
    • Health-related variables positively correlated with diabetes.
    • Higher positive correlation between PhysHlth and GenHlth.
    • Negative correlation between GenHlth and Income.
  • GLM Results:
    • GenHlth, HighBP, and BMI are top predictors for diabetes.
    • Backward selection model excludes Smoker, AnyHealthcare, and NoDocbcCost.
    • Stronger health indicators suggest lower diabetes likelihood.

Results

Key Findings:

  • PCA Analysis:
    • Clear distinction in diabetes status across principal components.
    • Top components capture significant variance in health-related variables.
  • Logistic Regression Model:
    • High accuracy in predicting diabetes using principal components.
    • Confirms the strong link between health indicators and diabetes.
  • Gender-Based GLM Analysis:
    • Men’s and women’s data show different predictive variables’ importance.
    • Suggests gender-specific approaches might be more effective in diabetes prediction.

Discussion

Concluding Thoughts

  • Key Predictive Variables in Diabetes Prognosis:
    • GenHlth, HighBP, and BMI emerged as significant predictors.
    • Highlights the interplay of overall health and specific medical indicators in diabetes risk.
  • Gender’s Influence on Diabetes:
    • Different predictive variables for men and women indicate a gender-specific impact.
    • Suggests tailored approaches in diabetes care based on gender.